1 Data

-Briefly describe your methods for gathering the data. -Present a table of summary statistics with variable descriptions. Sort these variables by their category (internal characteristics, amenities/public services or spatial structure). Check out the stargazer package for this. -Present a correlation matrix -Present 4 home price correlation scatterplots that you think are of interest. I’m going to look for interesting open data that you’ve integrated with the home sale observations. -Develop 1 map of your dependent variable (sale price) -Develop 3 maps of 3 of your most interesting independent variables. -Include any other maps/graphs/charts you think might be of interest.

1.1 Regression Summary Statistics

The summary statistics of the full model are presented below.

Dependent variable:
price
med_HH_Income -1.925***
(0.497)
pct.over75K 1,073,229.000***
(150,718.600)
pct.Information -2,203,228.000***
(360,495.200)
pct.Finance 1,594.924
(262,495.100)
pct.Professional -181,448.900
(152,270.900)
pct.Ed_Health -446,681.800***
(135,859.400)
nbrBedRoom 8,919.376*
(5,277.324)
nbrFullBaths -20,167.680***
(6,382.267)
TotalFinishedSF 159.369***
(8.851)
AcDscrEvaporative Cooler 52,784.020
(100,480.700)
AcDscrNo AC 53,806.290
(96,560.950)
AcDscrWhole House 63,817.470
(96,562.460)
Age 1,225.806***
(226.321)
schools_nn3 -32.421***
(4.985)
trailheads_nn5 -7.301
(5.761)
dist_FR -14.970***
(2.302)
qualityCodeDscrAVERAGE + -32,314.160*
(18,231.170)
qualityCodeDscrAVERAGE ++ 31,199.560*
(18,721.230)
qualityCodeDscrEXCELLENT 1,209,125.000***
(42,347.180)
qualityCodeDscrEXCELLENT + 1,501,140.000***
(94,512.870)
qualityCodeDscrEXCELLENT++ 2,053,472.000***
(74,958.720)
qualityCodeDscrEXCEPTIONAL 1 1,178,518.000***
(94,737.590)
qualityCodeDscrEXCEPTIONAL 2 1,989,543.000***
(250,755.900)
qualityCodeDscrFAIR -74,775.010
(45,893.580)
qualityCodeDscrGOOD 70,174.540***
(13,001.650)
qualityCodeDscrGOOD + 109,881.300***
(21,877.160)
qualityCodeDscrGOOD ++ 205,970.400***
(19,933.570)
qualityCodeDscrLOW -124,112.600
(97,682.940)
qualityCodeDscrVERY GOOD 300,628.500***
(21,468.180)
qualityCodeDscrVERY GOOD + 617,758.800***
(37,791.240)
qualityCodeDscrVERY GOOD ++ 681,368.300***
(30,880.430)
designCodeDscr2-3 Story -27,609.520**
(11,101.130)
designCodeDscrBi-level 49,911.570**
(25,385.950)
designCodeDscrMULTI STORY- TOWNHOUSE -132,319.300***
(15,917.080)
designCodeDscrSplit-level 19,440.710
(16,834.630)
ZipCode80025 -364,412.500
(283,330.700)
ZipCode80026 -283,686.300
(216,120.600)
ZipCode80027 -260,814.200
(216,935.800)
ZipCode80301 -163,691.500
(217,934.200)
ZipCode80302 156,545.000
(219,405.000)
ZipCode80303 -105,060.900
(218,506.400)
ZipCode80304 143,416.500
(219,907.900)
ZipCode80305 -79,906.970
(219,864.600)
ZipCode80403 -127,448.300
(227,619.200)
ZipCode80422 -37,943.930
(264,007.600)
ZipCode80455 -215,880.600
(232,920.400)
ZipCode80466 -224,580.500
(217,922.700)
ZipCode80471 -382,909.300
(490,554.700)
ZipCode80481 -65,293.540
(225,752.000)
ZipCode80501 -372,582.700*
(216,143.100)
ZipCode80503 -423,489.700*
(216,770.900)
ZipCode80504 -385,146.600*
(216,204.100)
ZipCode80510 290,204.600
(235,334.200)
ZipCode80516 -406,437.100*
(216,585.600)
ZipCode80540 -309,398.300
(222,118.100)
ZipCode80544 -393,245.800
(305,797.900)
Constant 861,197.800***
(245,352.800)
Observations 11,252
R2 0.518
Adjusted R2 0.516
Residual Std. Error 428,929.700 (df = 11195)
F Statistic 214.978*** (df = 56; 11195)
Note: p<0.1; p<0.05; p<0.01

1.2 Correlation Matrix

What is the use of a correlation matrix? What correlations do we see?

numericVars <- 
  select_if(st_drop_geometry(boulder.sf), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +   
    labs(title = "Correlation across numeric variables",
         caption = "Figure 1.1")

1.3 Variable Correlation Scatterplots

What are the factors? Why did we choose them? What trends do we see?

st_drop_geometry(boulder.sf) %>%
  dplyr::select(price, pct.over75K, pct.Professional, schools_nn3, trailheads_nn5) %>%
  filter(price <= 4000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 2, scales = "free") +
     labs(title = "Price as a function of continuous variables",
          caption = "Figure 1.2") +
     plotTheme()

1.4 Spatial Distribution of Home Sale Prices in Boulder County

# Price per square foot
ggplot() +
  geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
  geom_sf(data = boulder.sf, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(boulder.sf,"price"),
                   name="Quintile\nBreaks") +
  labs(title="Home Sale Prices, Boulder County",
       caption = "Figure 1.3") +
  mapTheme()

1.5 Mapping Independent Variables

# Distance from the Front Range
ggplot() +
  geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
  geom_sf(data = FrontRange, colour = "#2d6a4f", size = 2) +
  geom_sf(data = boulder_homes_observed, aes(colour = q5(dist_FR)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(boulder_homes,"dist_FR"),
                   name="Quintile\nBreaks") +
  labs(title="Home distance from Front Range",
       caption = "Figure 1.4") +
  mapTheme()

# Median Household Income in Boulder County
ggplot() + 
  geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
  geom_sf(data = acsTractsBoulder.2019.sf, aes(fill = med_HH_Income)) +
  scale_fill_viridis_b() +
  labs(title = "Median Household Income in Boulder County",
       subtitle = "by Census Tract",
       caption = "Figure 1.5") +
  mapTheme()

# Distance from the Front Range
ggplot() +
  geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
  geom_sf(data = boulder_schools, colour = "#2d6a4f") +
  labs(title = "Schools in Boulder County",
       caption = "Figure 1.6") +
  mapTheme()

# Sale Price + zipcodes
ggplot() +
  geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
  geom_sf(data = acsTractsBoulder.2019.sf, fill = NA, colour = "#55286F") +
  geom_sf(data = boulder_homes_observed, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(boulder_homes,"price"),
                   name="Quintile\nBreaks") +
  labs(title="Home Sale Prices + Zip Code Areas",
       caption = "Figure 1.7") +
  mapTheme()

2 Methods

Briefly describe your method (remember who your audience is).

3 Results

The following summary table presents the results of the linear regression on the training data set.

Dependent variable:
price
med_HH_Income -2.076***
(0.605)
pct.over75K 1,136,749.000***
(183,606.500)
pct.Information -2,517,058.000***
(435,785.800)
pct.Finance 66,818.700
(317,737.900)
pct.Professional -156,801.300
(186,024.500)
pct.Ed_Health -471,132.200***
(164,677.900)
nbrBedRoom 8,849.475
(6,480.865)
nbrFullBaths -19,258.620**
(7,773.679)
TotalFinishedSF 160.101***
(10.639)
AcDscrEvaporative Cooler 63,043.200
(108,443.400)
AcDscrNo AC 63,860.970
(103,894.300)
AcDscrWhole House 60,861.390
(103,894.800)
Age 1,109.136***
(273.193)
schools_nn3 -28.477***
(5.975)
trailheads_nn5 -7.063
(7.010)
dist_FR -14.736***
(2.792)
qualityCodeDscrAVERAGE + -34,740.310
(21,886.280)
qualityCodeDscrAVERAGE ++ 21,304.040
(22,337.710)
qualityCodeDscrEXCELLENT 1,142,633.000***
(50,293.660)
qualityCodeDscrEXCELLENT + 1,317,237.000***
(109,566.100)
qualityCodeDscrEXCELLENT++ 1,945,698.000***
(85,852.900)
qualityCodeDscrEXCEPTIONAL 1 1,150,923.000***
(107,404.700)
qualityCodeDscrEXCEPTIONAL 2 1,967,271.000***
(270,178.200)
qualityCodeDscrFAIR -78,528.760
(53,468.100)
qualityCodeDscrGOOD 62,221.240***
(15,838.690)
qualityCodeDscrGOOD + 102,477.700***
(26,319.790)
qualityCodeDscrGOOD ++ 210,331.600***
(24,042.850)
qualityCodeDscrLOW -132,648.500
(110,725.600)
qualityCodeDscrVERY GOOD 281,883.400***
(26,001.690)
qualityCodeDscrVERY GOOD + 610,056.000***
(44,912.940)
qualityCodeDscrVERY GOOD ++ 669,955.300***
(36,930.920)
designCodeDscr2-3 Story -26,532.340**
(13,465.430)
designCodeDscrBi-level 43,376.240
(29,920.470)
designCodeDscrMULTI STORY- TOWNHOUSE -130,140.200***
(19,321.290)
designCodeDscrSplit-level 14,541.880
(20,117.970)
ZipCode80025 -351,139.700
(306,081.300)
ZipCode80026 -270,869.000
(232,581.300)
ZipCode80027 -237,625.900
(233,677.500)
ZipCode80301 -129,292.700
(234,967.900)
ZipCode80302 149,099.200
(237,038.500)
ZipCode80303 -76,438.240
(235,781.900)
ZipCode80304 160,514.300
(237,727.400)
ZipCode80305 -54,384.380
(237,682.100)
ZipCode80403 -175,517.900
(247,367.600)
ZipCode80422 -55,758.940
(290,588.400)
ZipCode80455 -225,056.900
(252,463.100)
ZipCode80466 -233,085.200
(234,951.300)
ZipCode80471 -398,817.400
(528,420.400)
ZipCode80481 -90,260.160
(244,624.900)
ZipCode80501 -359,483.600
(232,634.800)
ZipCode80503 -413,351.300*
(233,471.700)
ZipCode80504 -372,423.700
(232,691.700)
ZipCode80510 231,214.800
(257,227.700)
ZipCode80516 -399,772.700*
(233,209.900)
ZipCode80540 -310,717.100
(240,450.600)
ZipCode80544 -402,348.900
(329,051.900)
Constant 851,203.400***
(266,642.400)
Observations 8,793
R2 0.485
Adjusted R2 0.482
Residual Std. Error 460,683.200 (df = 8736)
F Statistic 147.074*** (df = 56; 8736)
Note: p<0.1; p<0.05; p<0.01

3.1 Test Set Results

A summary of the mean absolute error and mean average percent error (MAPE) for the price prediction on the test data set is shown below.

Graphic representations of the results of the test set prediction are shown below.

# histogram of absolute errors
ggplot(boulder.test, aes(x = price.abserror)) +
  geom_histogram(binwidth=10000, fill = "green", colour = "white") +
  scale_x_continuous(limits = c(0, 1000000)) +
  labs(title = "Distribution of prediction errors for single test",
       x = "Sale Price Absolute Error", y = "Count") +
  plotTheme()

3.2 Cross Validation Results

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(boulder.sf), 
     method = "lm", trControl = fitControl, na.action = na.pass)

K-fold cross validation with 100 folds is used to explore the generalizability of this model. A histogram of the mean average error across the 100 folds is shown below.

# histogram of cross validation MAE
mae <- data.frame(reg.cv$resample[,3]) %>%
  rename(mae = reg.cv.resample...3.)

ggplot(mae, aes(x = mae)) +
  geom_histogram(binwidth=10000, fill = "orange", colour = "white") +
  scale_x_continuous(labels = c(0, 100000, 200000, 300000, 400000, 500000), 
                     limits = c(0, 500000)) +
  labs(title = "Distribution of MAE",
       subtitle = "k-fold cross validation; k = 100",
       x = "Mean Absolute Error", y = "Count") +
  plotTheme()

The prices predicted for the test set are plotted against the actual sale prices for the test set in the figure below.

ggplot(boulder.test) +
  geom_point(aes(price, price.predict)) +
  geom_smooth(aes(price, price), colour = "orange") +
  geom_smooth(method = "lm", aes(price, price.predict), se = FALSE, colour = "green") +
  labs(title = "Predicted sale price as a function of observed price",
       subtitle = "Orange line represents a perfect prediction; Green line represents prediction",
       x = "Observed Sale Price", y = "Predicted Sale Price") +
  plotTheme()

Map of residual absolute errors for the test set

ggplot() +
  geom_sf(data = boulder_boundary, fill = "grey") +
  geom_sf(data = boulder.test, aes(colour = q5(price.abserror)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(boulder.test,"price.abserror"),
                   name="Quintile\nBreaks") +
  labs(title="Test set absolute price errors",
       caption = "Figure X.X") +
  mapTheme()

Plot of the spatial lag in errors for the test set

coords.test <- st_coordinates(boulder.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

boulder.test %>%
  mutate(lagPriceError = lag.listw(spatialWeights.test, price.error)) %>%
  ggplot(aes(lagPriceError, price.error)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     labs(title = "Error as a function of the spatial lag of price errors") +
     plotTheme()

Plot of the Moran’s I test on the test set

moranTest <- moran.mc(boulder.test$price.error,
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

allPredictions <- boulder_homes %>%
  mutate(predictions = predict(reg1, boulder_homes)) %>%
  dplyr::select(predictions)

ggplot() +
  geom_sf(data = boulder_boundary, fill = "grey") +
  geom_sf(data = allPredictions, aes(colour = q5(predictions)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(allPredictions,"predictions"),
                   name="Quintile\nBreaks") +
  labs(title="Predictions for all homes in the dataset, Boulder County",
       caption = "Figure X.X") +
  mapTheme()

There is something so strange going on in this one zip code.

st_drop_geometry(boulder.test) %>%
  group_by(ZipCode) %>%
  summarize(mean.MAPE = mean(price.ape, na.rm = T)) %>%
  ungroup() %>% 
  left_join(boulder_zips) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = mean.MAPE)) +
      geom_sf(data = boulder.test, colour = "black", size = .5) +
      scale_fill_gradient(low = palette5[1], high = palette5[5],
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by Zip Code") +
      mapTheme()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "ZipCode"

We need the scatterplot of MAPE by zip of mean price by zip

testError_by_zips <-
left_join(
  st_drop_geometry(boulder.test) %>%
    group_by(ZipCode) %>%
    summarize(meanPrice = mean(price, na.rm = T)),
  st_drop_geometry(boulder.test) %>%
    group_by(ZipCode) %>%
    summarize(MAPE = mean(price.ape)))
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "ZipCode"
testError_by_zips %>%
  kable() %>% kable_styling()
ZipCode meanPrice MAPE
80026 628003.2 0.1814450
80027 702906.2 0.1624439
80301 841909.6 0.2166523
80302 1330729.5 0.3098138
80303 895975.3 0.1687126
80304 1422538.7 0.2876116
80305 883377.9 0.1771646
80403 436466.7 0.2104327
80422 785000.0 0.0348504
80455 497500.0 0.0499648
80466 552641.7 0.2866676
80481 499000.0 0.3135704
80501 425959.9 0.2012125
80503 698298.5 0.1868987
80504 510186.7 0.1960928
80510 294200.0 0.3622939
80516 617402.3 0.2375731
80540 491155.0 -0.3172851
ggplot(testError_by_zips) +
  geom_point(aes(meanPrice, MAPE)) +
  geom_smooth(method = "lm", aes(meanPrice, MAPE), se = FALSE, colour = "green") +
  labs(title = "MAPE by Zip Code as a function of mean price by Zip Code",
       x = "Mean Home Price", y = "MAPE") +
  plotTheme()

3.3 Generalizability

Is the model generalizable?

boulder_tracts19 <- 
  get_acs(geography = "tract", year = 2019, 
          variables = c("B01001_001E","B01001A_001E","B06011_001"), 
          geometry = TRUE, state = "CO", county = "Boulder", output = "wide") %>%
  st_transform('ESRI:102254')  %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B01001A_001E,
         Median_Income = B06011_001E) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 32322, "High Income", "Low Income"))

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(boulder_tracts19), 
  aes(fill = raceContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(boulder_tracts19), 
  aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), 
    name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + 
    theme(legend.position="bottom"))

4 Discussion

Is this an effective model? What were some of the more interesting variables? How much of the variation in prices could you predict? Describe the more important features? Describe the error in your predictions? According to your maps, could you account the spatial variation in prices? Where did the model predict particularly well? Poorly? Why do you think this might be?

5 Conclusion

Would you recommend your model to Zillow? Why or why not? How might you improve this model?